Measuring Effectiveness of Text-Decorated HTML Tags in Web Document Clustering

نویسندگان

  • Mark P. Sinka
  • David W. Corne
چکیده

Web document analysis, and its associated research, underpins much of what is referred to as web intelligence and the envisaged ‘semantic web’. A key issue in this field is how to encode a web document from the raft of potential document “features” without losing salient information. Current research almost always uses word-based feature vectors such as term frequency of specific words (TF) and/or variants such as normalised term frequency and TF*IDF. We explore the question of whether existing word-based term vectors can be usefully augmented by using text-decorated words delimited by the “” HTML tag. We measure the effectiveness of a feature vector by encoding documents from a benchmark set in terms of this feature vector, and then measuring the accuracy of an unsupervised clustering task using this encoding. A thorough investigation is performed, using a variety of parameter values, to explore whether any increase is accuracy is achieved over vectors constructed just from the plain document text. Tests on the BankSearch dataset showed 9 different parameter combinations (using the text-decorating tag words) that had an improved accuracy over the vectors obtained via the plain document text.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web-Document Retrieval by Genetic Learning of Importance Factors for HTML Tags

In contrast to conventional documents, a Web document consists of a number of tags which provide hints on the structure of the documents. In this paper, we propose a Web-document retrieval method using the characteristics of HTML tags. This method learns the importance of tags from a training text set. We use a genetic algorithm for learning the importance weights. We also present a modi ed sim...

متن کامل

Feature Weighting Improvement of Web Text Categorization Based on Particle Swarm Optimization Algorithm

It is usually true that some structures like title can express the main content of texts, and these structures may have an influence on the effectiveness of text categorization. However, the most common feature weighting algorithms, called term frequency-inverse document frequency (TF-IDF) doesn’t think about the structural information of texts. To solve this problem, a new feature weighting al...

متن کامل

Web Classification Approach Using Reduced Vector Representation Model Based on Html Tags

Automatic web page classification plays an essential role in information retrieval, web mining and web semantics applications. Web pages have special characteristics (such as HTML tags, hyperlinks, etc....) that make their classification different from standard text categorization. Thus, when applied to web data, traditional text classifiers do not usually produce promising results. In this pap...

متن کامل

Web Content Extraction through Histogram Clustering

We describe a method to extract content text from diverse Web pages by using the HTML document’s Text-To-Tag Ratio (TTR) rather than specific HTML cues that are not constant across various Web pages. We describe how to compute the TTR on a line-by-line basis and then cluster the results into content and non-content areas. The resulting TTR-histogram is not easily clustered because of its one di...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004